{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# User Feedback Proof of Concept\n",
    "\n",
    "The following notebook will create a small simulation of retraining our models while removing true-postive (\"anomalies\") to show that that a user feedback system is feasible with our current Anomaly Detection implementation. \n",
    "\n",
    "The goal is to show that the SOM model does have the ability to learn better representations of \"anomalies\" over time as more and more human-expert feedback is included in the training. \n",
    "\n",
    "### 1 - How Does Retraining Effect the SOM? \n",
    "\n",
    "Before getting into the NLP portion of our problem, let's start with a simple, intuitive example with color data to show how different data on subsequent re-trainings can effect our anomaly detection model.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "from pandas.io.json import json_normalize\n",
    "import gensim as gs\n",
    "import numpy as np\n",
    "import SOM\n",
    "import imageio\n",
    "import matplotlib.pyplot as plt\n",
    "import matplotlib.animation as anim\n",
    "%matplotlib inline\n",
    "import random\n",
    "\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "\n",
    "np.random.seed(42)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First we start with a randomly initialize map, and as training data we pulls colors at random from a uniform distribution to model the color space. We can see in the animation below that the initial SOM training works, and after 1000 training steps we have a well organized representation of the 3-dim color space we are attempting to map."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "colors_1 = np.random.rand(1000,3)\n",
    "mapp_1, training_animation = SOM.SOM(colors_1, 24,1000)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "'<img src=\"map_1.gif\" height=\"200\" width=\"200\">'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now lets say we have some anomaly we whish to identify. Here we will use a black pixel, [0,0,0]. Below we can see that it would recieve an anomly score of 0.301 vs the \n",
    "empirically derived threshold of 0.209. Great so this would be seen as an anomly. (Note: [0,0,0] is an extream value, but serves here as a clear example) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Threshold: 0.2091812488703562\n",
      "Anomaly Score: 0.3013205370166904\n"
     ]
    }
   ],
   "source": [
    "baseline  = []\n",
    "for i in colors_1:\n",
    "    baseline.append(SOM.get_anomaly_score(i,mapp_1))\n",
    "    \n",
    "print(\"Threshold:\", np.std(baseline)*3)\n",
    "print(\"Anomaly Score:\", SOM.get_anomaly_score(np.array([0,0,0]), mapp_1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "_______________________________________\n",
    "\n",
    "Now we want to see how retraing ontop of our current map impacts the model. We will do somethig admittedly contrived here, but nonetheless a clear example of how retraing works. We will retrain the model only on white pixles to show the impact new data can have on the model. We can see from the animation below that the these white data points quickly become a large part of the map.   "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "colors_2 = np.ones([1000,3])\n",
    "mapp_2, training_animation = SOM.SOM(colors_2, 24,1000, mapp_1,1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "'<img src=\"map_2.gif\" height=\"200\" width=\"200\">'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Again, taking [0,0,0] as our target anomlay. We can see that after retraining on only white examples, the anomaly score is greater than it was before. Furthermore, the score is necessarily equal to or greater than before retraining, as lighter colors now contribute more to underlying color space we are modeling. \n",
    "\n",
    "This demonstrates how retraining on the right data (based on human-expert input) can help to gradually improve our models perfromance.   "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Threshold: 0.0\n",
      "Anomaly Score: 0.4592001692275722\n"
     ]
    }
   ],
   "source": [
    "baseline  = []\n",
    "for i in colors_2:\n",
    "    baseline.append(SOM.get_anomaly_score(i,mapp_2))\n",
    "    \n",
    "print(\"Threshold:\", round(np.std(baseline)*3))\n",
    "print(\"Anomaly Score:\", SOM.get_anomaly_score(np.array([0,0,0]), mapp_2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "_____________________________\n",
    "\n",
    "Now lets say we retrain a second time. How will this impact our model? This time we will train on a bright green pixel [0,1,0]. We can see from the animation below that our map is now representative of both the white and the green examples its seen most recently, but still retains a corner of the over all color spectrum.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "colors_3 = np.ones([1000,3]) * np.array([0,1,0])\n",
    "mapp_3, training_animation = SOM.SOM(colors_3, 24,1000, mapp_2,2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "'<img src=\"map_3.gif\" height=\"200\" width=\"200\">'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So, agin we get a fairly high anomlay score (0.4). However, given [0,1,0] is closer to [0,0,0] than [1,1,1], in this case we are not guaranteed to get a higher score. But agin we can get an intuition for how the retraining works. Since the anomaly score measures the distance from the test vector to the node on the map which is closest, the new light green data impacted the rest of the map and moved which ever node was closest to [0,0,0] in the previous training session a little closer to [0,1,0] thereby increasing the black pixles anomlay score.      "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Threshold: 0.0\n",
      "Anomaly Score: 0.6867127750863574\n"
     ]
    }
   ],
   "source": [
    "baseline  = []\n",
    "for i in colors_3:\n",
    "    baseline.append(SOM.get_anomaly_score(i,mapp_3))\n",
    "    \n",
    "print(\"Threshold:\", round(np.std(baseline)*3))\n",
    "print(\"Anomaly Score:\", SOM.get_anomaly_score(np.array([0,0,0]), mapp_3))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "____________________________\n",
    "\n",
    "Finally, we want to show how on a 3rd re-traning round with less uniform data (the same kind of randomly generated colors from our intial trainig), will impact our map. From the animation below, we can see that the new data quickly overtakes the large green and white spots, however, comparing this map to the intial trainig, we can see that it, generally, favors lighter, greener colors, indicating that the information from the previous training rounds reamains encodeded in our map  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "colors_4 = np.random.rand(1000,3)\n",
    "mapp_4, training_animation = SOM.SOM(colors_4, 24,1000, mapp_3, 3 )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "'<img src=\"map_4.gif\" height=\"200\" width=\"200\">'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, however, given the nature of the input data, we are back to map that is far similar, with simialar anomlay score and threshold as our original map. This is all to demonstrate the impact different traning data can have on our models ability to detect anomalies as it evolves over multiple retraining.   "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Threshold: 0.19595867797999827\n",
      "Anomaly Score: 0.3960642414166838\n"
     ]
    }
   ],
   "source": [
    "baseline  = []\n",
    "for i in colors_4:\n",
    "    baseline.append(SOM.get_anomaly_score(i,mapp_4))\n",
    "    \n",
    "print(\"Threshold:\", np.std(baseline)*3)\n",
    "print(\"Anomaly Score:\", SOM.get_anomaly_score(np.array([0,0,0]), mapp_4))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
